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Abstract 

This paper derives a novel linear position constraint for cameras seeing a common 
scene point, which leads to a direct linear method for global camera translation esti¬ 
mation. Unlike previous solutions, this method deals with collinear camera motion and 
weak image association at the same time. The final linear formulation does not involve 
the coordinates of scene points, which makes it efficient even for large scale data. We 
solve the linear equation based on L\ norm, which makes our system more robust to 
outliers in essential matrices and feature correspondences. We experiment this method 
on both sequentially captured images and unordered Internet images. The experiments 
demonstrate its strength in robustness, accuracy, and efficiency. 

Introduction 

Structure-from-motion (SfM) algorithms aim to estimate scene structure and camera motion 
from multiple images, and they can be broadly divided into incremental and global meth¬ 
ods according to their ways to register cameras. Incremental methods register cameras one 
by one [S3, S3] or iteratively merge partial reconstructions [D3, EE]. These methods re¬ 
quire frequent intermediate bundle adjustment (BA) to ensure correct reconstruction, which 
is computationally expensive. Yet, their results often suffer from large drifting errors. In 
comparison, global methods ( e.g . [HD, EE, E3, ED, EE, SE]) register all cameras simultane¬ 
ously, which has better potential in both efficiency and accuracy. 

Global SfM methods often solve the camera orientations and positions separately. The 
global position estimation is more challenging than the orientation estimation due to the 
noisy pairwise translation encoded in essential matrices [□]. This paper focuses on the 
problem of global position ( i.e . translation) estimation. 

Essential matrix based translation estimation methods [i, E, EE] can only determine cam¬ 
era positions in a parallel rigid graph [EE], and they usually degenerate at collinear camera 
motion because the translation scale is not determined by an essential matrix. Trifocal ten¬ 
sor based methods [□, IZ3, E3, O] could deal with collinear motion as the relative scales 
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Result from [IZ3] Our result 


Figure 1: 1DSFM [S3] and triplet-based methods ( e.g. [E3]) require strong association 
among images. As shown in the left, they fail for images with weak association. In compar¬ 
ison, as shown in the right, the results from our method do not suffer from such problems. 

are encoded in a trifocal tensor. However, these methods usually rely on a strongly con¬ 
nected camera-triplet graph, where two triplets are connected by their common edge. The 
3D reconstruction will distort or break into disconnected components when such strong as¬ 
sociation among images does not exist. By solving cameras and scene points together, some 
global methods [HD, El, ED, SO, S3] can deal with collinear motion. These methods usually 
need to filter epipolar geometries (EGs) carefully to avoid outliers. Including scene points in 
the formulation also hurts the scalability of the algorithm, since there are many more scene 
points than cameras. The recent IDSfM method [S3] designs a smart filter to discard out¬ 
lier essential matrices and solves scene points and cameras together by enforcing orientation 
consistency. However, this method requires abundant association between input images, e.g. 
~ 0(n 2 ) essential matrices for n cameras, which is more suitable for Internet images and 
often fails on sequentially captured data. 

The data association problem of IDSfM [S3] and triplet-based methods (here, we take 
m as an example) is exemplified in Figure 1 . The Street example on the top is a sequential 
data. IDSfM fails on this example due to insufficient image association, since each im¬ 
age is only matched to its 4 neighbors at the most. In the Seville example on the bottom, 
those Internet images are mostly captured from two viewpoints (see the two representative 
sample images) with weak affinity between images at different viewpoints. This weak data 
association causes seriously distorted reconstruction for the triplet-based method in [E3]. 

This paper introduces a direct linear algorithm to address the presented challenges. It 
avoids degeneracy at collinear motion and deals with weakly associated data. Our method 
capitalizes on constraints from essential matrices and feature tracks. For a scene point visible 
in multiple (at least three) cameras, we consider triangles formed by this scene point and two 
camera centers. We first generalize the camera-triplet based position constraint in [E3] to our 
triangles with scene points. We then eliminate the scene point from these constraints. In this 
way, we obtain a novel linear equation for the positions of cameras linked by a feature track. 
Solving these linear equations from many feature tracks simultaneously register all cameras 
in a global coordinate system. 
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This direct linear method minimizes a geometric error, which is the Euclidean distance 
between the scene point to its corresponding view rays. It is more robust than the linear 
constraint developed in [SO], which minimizes an algebraic error. A key finding in this pa¬ 
per is that, a direct linear solution (without involving scene points) exists by minimizing 
the point-to-ray error instead of the reprojection error. Since the point-to-ray error approxi¬ 
mates the reprojection error well when cameras are calibrated, our method is a good linear 
initialization for the final nonlinear BA. 

At the same time, this direct linear formulation lends us sophisticated optimization tools, 
such as L\ norm optimization [□, □, O, El]. We minimize the L\ norm when solving the 
linear equation of camera positions. In this way, our method can tolerate a larger amount of 
outliers in both essential matrices and feature correspondences. The involved L\ optimization 
is nontrivial. We derive a linearization of the alternating direction method of multipliers 
algorithm [□] to address it. 


2 Related Work 

Incremental approaches. Most of well-known SfM systems register cameras sequentially 
[□, EB, E0, ID, El] or hierarchically [01, E3, EB] from pairwise relative motions. In order to 
minimize error accumulation, frequent intermediate bundle adjustment is required for both 
types of methods, which significantly reduces computation efficiency. The performance of 
sequential methods relies heavily on the choice of the initial image pair and the order of 
subsequent image additions [O]. 

Global rotation estimation. Global SfM methods solve all camera poses simultane¬ 
ously. Most of these methods take two steps. Typically, they solve camera orientations first 
and then positions. The orientation estimation is well studied with an elegant rotation aver¬ 
aging algorithm presented in [ED]. The basic idea was first introduced by Govindu [UB], and 
then developed in several following works [O, I2ZD, El] . In particular, [B] introduced a robust 
L\ method which was adopted in several recent works [EB, SB]. 

Global translation estimation. The translation estimation is more challenging. Some 
pioneer works [i, □, DB, 03] solved camera positions solely from constraints in essential ma¬ 
trices. Typically, they enforce consistency between pairwise camera translation directions 
and those encoded in essential matrices. Recently, Ozyesil and Singer [EB] prove that essen¬ 
tial matrices only determine camera positions in a parallel rigid graph, and present a convex 
optimization algorithm to solve this problem. In general, all these essential matrix based 
methods degenerate at collinear motion, where cameras are not in a parallel rigid graph. 

This degeneracy can be avoided by exploiting relative motion constraints from camera 
triplets [0, E2, O, S3], as the trifocal tensor encodes the relative scale information. Recently, 
Jiang et al [E3] derived a novel linear constraint in a camera triplet and solved all cameras 
positions in a least square sense. While triplet-based methods avoid degenerated camera 
motion, they often require strong association among images - a connected triplet graph, 
where camera triplets are connected by common edges. 

Some global methods estimate cameras and scene points together. Rother [EO] solved 
camera positions and points by minimizing an algebraic error. Some works [□, El, El, ED, El] 
solved the problem by minimizing the L ^ norm of reprojection error. However, the L^ 
norm is known to be sensitive to outliers and careful outlier removal is necessary [O, EB]. 
Recently, Wilson and Snavely [EB] directly solved cameras and points by Ceres Solver [01] 
after applying a smart filter to essential matrices. Generally speaking, involving scene points 
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Figure 2: (a) The positions of a scene point p and two camera centers c* and c j satisfy a 
linear constraint detailed in Section 3.1. (b) The positions of four cameras seeing the same 
scene point satisfy a linear constraint detailed in Section 3.2. 

improves the robustness/accuracy of camera registration, but also significantly increases the 
problem scale. Feature correspondence outliers also pose a significant challenge for these 
methods. 

Our method capitalizes on constraints in essential matrices and feature tracks. It avoids 
degeneracy at collinear motion, handles weak data association, and is robust to feature cor¬ 
respondence outliers. 

3 Global Translation Estimation 

Given an essential matrix between two images i,j (e.g. computed by the five-point algorithm 
m E3]), we obtain the relative rotation R ?/ and translation direction t^ between the two 
cameras. Here, Ris a 3 x 3 orthonormal matrix and t y is a 3 x 1 unit vector. We fur¬ 
ther denote the global orientation and position of the i -th (1 < i < N) camera as R/ and ir¬ 
respectively. These camera poses are constrained by the following equations 

R,= R,,R/. R/(c,- Cj ) — tij . (1) 

Here, ~ means equal up to a scale. 

Like most global methods, we compute camera orientations first and solve camera posi¬ 
tions after that. We adopt the global rotation estimation method in [0]. In order to enhance 
robustness, we adopt additional loop verifications [S3] on the input pairwise relative cam¬ 
era rotations beforehand. Specifically, we chain the relative rotations along a three-camera 
loop as R = R/yR/^R^-, and compute the angular difference [O] between R and the identity 
matrix. If the difference is larger than a threshold (pi (3 or 5 degrees for sequential data 
or unordered Internet data), we consider the verification fails. We discard an EG if every 
verification it participates in fails. 

The key challenge in translation estimation is that an essential matrix does not tell the 
scale of translation. We seek to obtain linear equations for those unknown scales without 
resorting to camera-triplets. Our translation estimation is based on a linear position con¬ 
straint arising from a triangle formed by two camera positions and a scene point. With this 
constraint, the positions of cameras linked by a feature point should satisfy a linear equation. 

3.1 Constraints from a triangle 

A linear constraint on positions of cameras in a triplet is derived in [123]. We generalize it 
to the case of triangles formed by a scene point and two cameras. As shown in Figure 2, 
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we compute the location of a scene point p as the middle point of the mutual perpendicular 
line segment AB of the two rays passing through p’s image projections. Specifically, it is 
computed as 

1 1 

P =-(A+B) = -( Ci + Simi + Cj + Sjmj ). (2) 

Here, c ? and c j are the two camera centers. The two unit vectors m ; and my origin from the 
camera centers and point toward the image projections of p. Si and sj are the distances from 
the points A, B to C;, Cy respectively, i.e. A = c* + and B = cy + Sjmj. 

The rotation trick. The rotation trick used in [E3] shows that we can compute m* and 
m ; by rotating the relative translation direction Cyy between c, and Cy, i.e. m ? = R(0j)c*y and 
m / = ij. Then Equation 2 becomes 


P= 2 I Ci + .SiR(0/) 


' + c /' + 5 / R ( 0 / 


c f~ c 7 

l c *-cy 


(3) 


The two 3D rotation matrices R(0;) and R(0y) rotate the relative translation direction C/y 
to the directions m ; and my. Both rotations can be computed easily in the local pairwise 
reconstruction. In addition, the two ratios Si/ 11 Cy — C/11 and v y / 11 cy — c z - 11 can be computed 
by the middle-point algorithm [ED]. Specifically, assuming unit baseline length, in the local 
coordinate system attached to one of the cameras, C/.Cy, m ? , and my are all known. Thus, we 
can solve Si and sj (they are actually the two ratios in Equation 3 for general baseline length) 
from 

(c/ + ^m ; - cj - Sjmj ) x (m, xmy) = 0. (4) 

Here, x is the cross product of vectors. Thus, Equation 3 becomes, 


Ay-,,A'-')(c, c/) I c, I Cj 


(5) 


where A \ J =Si/\ |cy — c,-| |R(0/) and A l j =Sj/\ |c ? - — c 7 j|R(@ 7 -) are known matrices. This equa¬ 
tion provides a linear constraint among positions of two camera centers and a scene point. 
Note this linear constraint minimizes a geometric error, the point-to-ray distance. 


3.2 Constraints from a feature track 

If the same scene point p is visible in two image pairs c*,cy and c^,c/ as shown in Figure 2 
(b), we obtain two linear equations about p’s position according to Equation 5. We can 
eliminate p from these equations to obtain a linear constraint among four camera centers as 
the following, 

(Ay - A|')(C/ - Cj) + c i + Cj = (Af - Af) (c* - Cl) +c k + c t . (6) 

Given a set of images, we build feature tracks and collect such linear equations from camera 
pairs on the same feature track. Solving these equations will provide a linear global solution 
of camera positions. To resolve the gauge ambiguity, we set the orthocenter of all cameras 
at origin when solving these equations. 

Equation 6 elegantly correlates the scales of translation (i.e. baseline length ||c/ — Cy || and 
||— C/1|) for camera pairs sharing a common scene point. For example, in the Seville data 
in Figure 1 (bottom row), c,-, cy could come from one popular viewpoint of the building, and 
c^, c/ could come from a different viewpoint. As long as there is a feature track linking them 
together, Equation 6 provides constraints among the baseline lengths of these far apart cam¬ 
eras. In comparison, triplet-based methods (e.g. [E3, E3, O]) can only propagate the scale 
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information over camera triplets sharing an edge. Clearly, the scale information can prop¬ 
agate further along feature tracks. Therefore, this new formulation can reconstruct images 
with weak association better than triplet-based methods. 

3.3 Feature tracks selection 

Since there are usually abundant feature tracks to solve camera positions, we carefully 
choose the most reliable ones to enhance system robustness. For better feature matching 
quality, we only consider feature correspondences that are inliers of essential matrix fitting. 
We sort all feature tracks by their lengths in descending order, and then try to find a small 
set of tracks that could cover all connected cameras at least K times. (We set K = 30 in 
our experiments.) Please see our supplementary material for the pseudo-code of the feature 
tracks selection. 

For a feature track with N t cameras, there are usually more than N t — 1 EGs on it. So we 
select the most reliable ones to construct equations. We consider the match graph formed 
by these cameras, where two cameras are connected when their essential matrix is known. 
Since we only consider feature correspondences passing essential matrix verification, this 
graph has only one connected component. We weight each graph edge by ^ + a ^, where M 
is the number of feature matches between two images, and 6 is the triangulation angle. The 
combination weight a is fixed at 0.1 in our experiments. We take the minimum spanning 
tree of this graph, and randomly choose two edges from the tree to build a linear equation 
until each edge is used twice. 


4 Robust Estimation by L\ Norm 

Our linear global method requires solving a linear system like Ax = 0 to estimate camera 
centers, x represents an unknown vector formed by concatenating all camera positions, and 
A is the coefficient matrix formed by collecting Equation 6 from feature tracks. 

The 3D reconstruction process is noisy and involves many outliers, both in essential ma¬ 
trices and feature correspondences. We enhance system robustness by minimizing the L\ 
norm, instead of the conventional L 2 norm. In other words, we solve the following optimiza¬ 
tion problem, 

argnnnHAxI^, s.t. x T x=l. (7) 

This problem might be solved by iterative reweighted total least squares, which is often slow 
and requires good initialization. Recently, Ferraz et al. [O] proposed a robust method to 
discard outliers, while it is not applicable to our large sparse homogeneous system. We 
capitalize on the recent alternating direction method of multipliers (ADMM) [□] for bet¬ 
ter efficiency and large convergence region. Due to the quadratic constraint, i.e. x T x = 1, 
the original ADMM algorithm cannot be directly applied to our problem. We linearize the 
optimization problem in the inner loop of ADMM to solve Equation 7. 

Let e = Ax, the augmented Lagrangian function of Equation 7 is 

L(e,x,A) » ||e|| x + (A,Ax-e) + ^ ||Ax-e|| 2 , s.t. x T x = 1, (8) 

where A is the Lagrange multiplier, (•,•) is the inner product, and (5 > 0 is a parameter 
controlling the relaxation. 
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Dataset 

c err (GT,mm) 

Rother[ED] 

Jiang [E3] 

Moulon[ID] 

Arie[i] 

VisualSFM[E3] 

lDSfM[EB] 

l 2 

Li 

fountain-Pll 

2.5 

3.1 

2.5 

2.9 

3.6 

33.5 

2.5 

2.5 

Herz-Jesu-P25 

5.0 

7.5 

5.3 

5.3 

5.7 

36.3 

5.0 

5.0 

castle-P 30 

347.0 

72.4 

21.9 

- 

70.7 

- 

21.6 

21.2 


Table 1: Reconstruction accuracy comparison on benchmark data with ground truth (GT) 
camera intrinsics. 


Dataset 

c err (EXIF,mm) 

Rother[E0] 

Jiang [123] 

Arie[i] 

VisualSFM[0] 

lDSfM[E3] 

Li 

Li 

fountain-Pll 

23.3 

14.0 

22.6 

20.7 

32.2 

6.9 

7.0 

Herz-Jesu-P25 

49.5 

64.0 

47.9 

45.3 

64.9 

25.5 

26.2 

castle-P30 

2651.8 

235.0 

- 

190.1 

- 

317.3 

166.7 


Table 2: Reconstruction accuracy comparison on benchmark data with approximate intrin¬ 
sics from EXIF. The results by Moulon[E2] are not available. 


We then iteratively optimize e, x, and A in Equation 8. In each iteration, we update e^ + i, 
x^+i, A^ + i according to the following scheme, 


B 2 

e *+i = argrmnL(e,x*,A*) = argrmn||e|| 1 + (Ai,Ax i -e) + -||Ax i -e|| , (9) 

B 2 

X(fc+1 = argminL(e* + i,x,A*) = argmin(A(fc,Ax — e£ + i) + -||Ax — e*+i|| , (10) 
xell xell 2 

^k+l = ^k + (5 (Ax^+i (11) 

where Q := {x T x = l|x E R n }. A closed-form solution [ED] exists for the minimization of 
Equation 9. (Please see Appendix A for the formula.) Solving Equation 10 is hard because of 
the quadratic constraint on x. Therefore, we linearize Equation 10 and derive a closed-form 
solution as, 

x* +1 =C/||C|| 2 , (12) 

where C = x^ — 3-A T (Ax£ — e^+i) — ^A T A^, and p > o max (A r A). (Please see Appendix 
B for more details.) In order to speed up convergence [ED], we adopt a dynamic parameter /3 
as, 

Pk+l = m i n {Pmax7 ( 13 ) 

where p > 1. We set p as 1.01 or 1.1 for sequential data and Internet data respectively in our 

experiments. Algorithm 1 summarizes our linearized ADMM algorithm. 


Algorithm 1 Our linearized ADMM for Equation 7. 

1: Initialize: Set xo as to the L 2 solution (i.e. the eigenvector with smallest eigenvalue of 
A), e 0 = 0, Ao = 0, jS 0 = llT 6 ; 

2: while not converged, do 

3: Step 1: Update e by solving Equation 9; 

4: Step 2: Update x by solving Equation 12; 

5: Step 3: Update A by solving Equation 11; 

6: Step 4: Update /3 by solving Equation 13; 

7: end while 
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(a) Building (b) Street (c) Park (d) Stair 

Figure 3: Evaluation on sequential data. From top to bottom, each row shows sample in¬ 
put images, 3D reconstructions generated by our method, VisualSFM [S3], and the least 
unsquared deviations (LUD) method [E3] respectively. 


5 Experiments 


5.1 Evaluation on benchmark data 

We compare our method with VisualSFM [S3], and several global SfM methods on the 
benchmark data provided in [S3]. We use both ground truth camera intrinsics and approxi¬ 
mate intrinsics from EXIF in the experiment. We implement the method in [SD] by ourselves. 
The results on VisualSFM [S3] and IDSfM [S3] are obtained by running the codes provided 
by the authors. To evaluate the L\ norm optimization, we also experiment the conventional 
L 2 norm optimization instead of the L\ norm in Equation 7. The results are indicated as L\ 
and L 2 respectively. 

We summarize all the results in Table 1 and Table 2. All results are evaluated after the fi¬ 
nal bundle adjustment. Our method generally produces the smallest errors with either ground 
truth intrinsics or approximate ones from EXIF. The L 2 and L\ methods produce similar re¬ 
sults on the fountain-P11 and Herz-Jesu-P25 data, since these data have few outliers. But the 
L\ method outperforms L 2 significantly on the castle-P30 data, whose essential matrix esti¬ 
mation suffers from repetitive scene structures. The noisy epipolar geometries also cause bad 
performance of the method in [EO] on the castle-P30 data, which solves cameras and scene 
points together by minimizing an algebraic error. In comparison, our method minimizes a 
geometric error, which is the point-to-ray distance, and achieves better robustness. 


5.2 Experiment on sequential data 

Figure 3 summarizes our experiment on sequentially captured data and compares our method 
with an incremental method, VisualSFM [S3], and some recent global methods [EB, 13]. 
The test data Building , Street , Park and Stair have 128, 168, 507 and 1221 input images 
respectively. Generally speaking, VisualSFM [S3] suffers from large drifting errors when 
the input essential matrices are noisy or when the image sequence is long, as shown in 
the third row of Figure 3. The drifting errors in the Park and Stair examples are severe 
because of the poor essential matrix estimation due to poor feature localization in their tree 
images. We do not include results from IDSfM [E3] in Figure 3, since it fails on all these 
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Dataset 

IDSfM [ED] 

LUD[E3] 

l 2 

L X 

Name 

Ni 

N c 

X 

xba 

xba 

N c 

X 

xba 

xba 

N c 

X 

N c 

X 

xba 

xba 

Alamo 

613 

529 

1.1 

0.3 

2e7 

547 

0.4 

0.3 

2.0 

496 

0.5 

500 

0.6 

0.5 

3.7 

Ellis Island 

242 

214 

3.7 

0.3 

3.0 

- 

- 

- 

- 

183 

9.4 

211 

3.1 

0.6 

1.8 

Montreal N.D. 

467 

427 

2.5 

0.4 

1.0 

435 

0.5 

0.4 

1.0 

424 

0.8 

426 

0.8 

0.4 

1.1 

Notre Dame 

552 

507 

10 

1.9 

7.0 

536 

0.3 

0.2 

0.7 

537 

0.3 

539 

0.3 

0.2 

0.8 

NYC Library 

369 

295 

2.5 

0.4 

1.0 

320 

2.0 

1.4 

7.0 

- 

- 

288 

1.4 

0.9 

6.9 

Piazza del Popolo 

350 

308 

3.1 

2.2 

2e2 

305 

1.5 

1.0 

4.0 

302 

3.6 

294 

2.6 

2.4 

3.2 

Tower of London 

499 

414 

11 

1.0 

40 

425 

4.7 

3.3 

10 

311 

17 

393 

4.4 

1.1 

6.2 

Vienna Cathedral 

897 

770 

6.6 

0.4 

2e4 

750 

5.4 

4.4 

10 

574 

3.6 

578 

3.5 

2.6 

4.0 

Yorkminster 

450 

401 

3.4 

0.1 

5e2 

404 

2.7 

1.3 

4.0 

333 

3.9 

341 

3.7 

3.8 

14 


Table 3: Comparison with [S3] on challenging data. A; denotes the number of cameras in the 


largest connected component of our EG graph, and N c denotes the number of reconstructed 
cameras, x denotes the median error before BA. xba and xba denote the median error and the 
average error after BA respectively. The errors are the distances in meters to corresponding 
cameras computed by an incremental SfM method [S3] . 


Dataset 

lDSfM[EB] 

LUD[E3] 

Bundler [!□] 

Lx 

Tba 

L 

Tba 

E 

L 

Tba 

E 

Alamo 

752 

910 

133 

750 

1654 

362 

621 

Ellis Island 

139 

171 

- 

- 

1191 

64 

95 

Montreal N.D. 

1135 

1249 

167 

553 

2710 

226 

351 

Notre Dame 

1445 

1599 

126 

1047 

6154 

793 

1159 

NYC Library 

392 

468 

54 

200 

3807 

48 

90 

Piazza del Popolo 

191 

249 

31 

162 

1287 

93 

144 

Tower of London 

606 

648 

86 

228 

1900 

121 

221 

Vienna Cathedral 

2837 

3139 

208 

1467 

10276 

717 

959 

Yorkminster 

111 

899 

148 

297 

3225 

63 

108 


Table 4: Running times in seconds for the Internet data. Tba and 7k denote the final bundle 
adjustment time and total running time respectively. 

sequential data. Its result on the Street data is shown in Figure 1 (left of top row). IDSfM 
cannot handle these examples because it is designed for Internet images which tend to have 
0(n 2 ) essential matrices for n images 1 . The least unsquared deviations (LUD) method [EZ3] 
generates distortion on the Street example, because it degenerates at collinear motion. In 
comparison, our method does not have visible drifting and is robust to collinear motion and 
weak image association. 


5.3 Experiment on unordered Internet data 

The input epipolar geometries for Internet data are quite noisy because of the poor feature 
matching. Besides using L\ optimization, we adopt two additional steps to improve the 
robustness of our method. After solving camera orientations by the method in [B], we further 
filter input essential matrices with the computed camera orientations. Specifically, for each 
camera pair, we compare their relative rotation from their global orientation with the relative 
rotation encoded in their essential matrix. If the difference is larger than a threshold (p 2 (set 
to 5 or 10 degrees), we discard that essential matrix. What’s more, we refine the relative 
translations with the camera orientations fixed using the method in [123]. 

We test our method on the challenging Internet data released by [S3] and compare our 
method with several global methods. We use the results of an optimized incremental SfM 
system based on Bundler [S3] as the reference ‘ground-truth’ and compute the camera posi¬ 
tion errors for evaluation. As shown in Table 3, our method with L\ optimization has smaller 
initial median errors than [S3] and comparable errors with [E3]. Our method with L\ opti- 

lr This is according to our discussion with the authors of IDSfM [E3]. 
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mization performs better than L 2 solutions, which shows the effectiveness of the proposed 
L\ method. All the methods have similar results after the final bundle adjustment. 

Table 4 lists the running time of different methods. All our experiments were run on a 
machine with two 2.4GHz Intel Xeon E5645 processors with 16 threads enabled. We cite the 
running time for [S3], [EB] and [S3] for comparison. Our method is around 10 times faster 
than the optimized incremental method [S3] and also faster than the global methods [EB, SB]. 


6 Conclusion 

We derive a novel linear method for global camera translation estimation. This method is 
based on a novel position constraint on cameras linked by a feature track, which minimizes 
a geometric error and propagates the scale information across far apart camera pairs. In this 
way, our method works well even on weakly associated images. The final linear formulation 
does not involve coordinates of scene points, so it is easily scalable and computationally 
efficient. We further develop an L\ optimization method to make the solution robust to 
outlier essential matrices and feature correspondences. Experiments on various data and 
comparison with recent works demonstrate the effectiveness of this new algorithm. 
Acknowledgements. This work is supported by the NSERC Discovery grant 611664, Dis¬ 
covery Acceleration Supplements 611663, and the HCCS research grant at the ADSC from 
Singapore’s Agency for Science, Technology and Research (A*STAR). 


Appendix 


A. Solution for Equation 9. From Equation 9, we have 


1 


1 


= argmm-||e|| 1 + - 


Ax k - e + — 


= argimne||e|| r 


e-u 


(14) 


where £ = A, and u = Ax^ + According to [IZ9], the solution for Equation 14 is 


\ U l — £, if U* > £, 
4+1 = U ! + £ if U< < -£, 

1 0, otherwise , 


(15) 


where e^ +1 and u* are the i -th element of e and u. 


B. Derivation of Equation 12. We linearize the quadratic term 
tion 10 at x^, which gives 

Xfc+1 = argmin ( A_r 4,x) + f A T (Ax* - e* +1 ),x - x k ) + 

= argmin llx —C|| 2 , 
xe£2 2 " " 


| ||Ax-e i+ i|| 2 in Equa- 


jSrj 

- x — xd 

2 11 


(16) 


where £2 := {x T x = l|x e W 1 }, C = x^ — 4-A T (Ax^ — e^+i) — ^A T X k , and T] > <7 max (A T A) 
is a proximal parameter. Therefore, we can get Equation 12 directly. 
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